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1 Introduction 

1.1 Motivating problem 

In natural populations, animals are likely to be infected by a variety of pathogens, 
either simultaneously or successively. Interactions between these pathogens, 
which can be synergistic or antagonistic, can affect infection biology (e.g. the 
intensity of one or both infections), host susceptibility to infection, or may 
impact on the host's morbidity or/and mortality. However, the biological pro- 
cesses involved are often too complex to allow clear-cut predictions regarding 
the outcome of such interactions. In order to explore potential interactions, a 
longitudinal study was undertaken by recording the sequences of infection events 
for different parasites in four spatially distinct populations of field voles {Micro- 
tus agrestis). The data are records of six pathogens: three species of Bartonella 
bacteria (B. taylorii, B. grahamii, B. doshiae), cowpox virus, the bacterium 
Anaplasma phagocytophilum and the protozoan Babesia microti. Aside from 
their intrinsic interest as a community of pathogens, Bartonella, Anaplasma, 
Babesia and cowpox virus infections may also be zoonotic: capable of being 
transmitted from animals to humans and causing disease. 

As in most capture-mark-recapture studies, a different set of voles was caught 
at each session leading to incomplete profiles for all subjects. The dataset there- 
fore contains many missing observations; for example a profile for a given vole 
and a given disease from the first to last observation times for that vole might 
be NPxxPNxP, where x, N and P respectively indicate a missing observation, 
a negative response and a positive response. Inference on incomplete data in 
longitudinal and capture-recapture studies is a major problem; for examples see 



Daniels and Hogan ( 2008 1 and Pradel ( 2005 1 . Previous analyses of our and 



related datasets (see Teller et al. (20101 and Begon et al. (2009)) have exam 



ined all pairs of observations for a given vole that occurred exactly one lunar 
month apart and for which the first of the two observations was an N . The 
influence of each covariate on the probability of contracting a disease is then 
ascertained through logistic regression. In this paper we offer a more realistic 
model and a more powerful analysis methodology for investigating the effects 
of previous infections for each disease on the other diseases. We use a hidden 



Markov model for each disease (Section 2.1) and perform inference via a Gibbs 



sampler; this allows us to use all of the dataset and to infer covariate effects 
on a given disease, even when these covariates are the (potentially missing or 
hidden) states of the other five diseases. 

1.2 Data 

We analyse data collected between March 2005 and March 2007 from field voles 
in Kielder Forest, a man-made forest on the English-Scottish border. The voles 
were trapped at four grassy clear-cut sites within the forest, with each site at 
least 3.5km from the nearest neighbouring site. Individuals were trapped within 
a 0.3ha live-trapping grid comprising 100 traps set at 5m intervals, with trapping 



taking place every 28 days from March to November, and every 56 days from 
November to March. iBegon et al. (2009) provides further details of the study 



area and the trapping design. 



Table 1: Description of variables in the data set and their possible outcomes 



Variable 


Description 




Tag 


Unique number that identifies each vole 




Site 


Identifier for the capture site (4 level factor) 




Sex 


Male/Female 




Lm 


Capture time point in whole lunar months (1 - 


- 27, integer) 


Weight 


Weight in grams rounded to the nearest 0.5g 




Sin 


sin(27rLm/13) 




Cos 


cos(27rLm/13) 




Tay 


B. taylorii, N (negative) or P (positive) 




Grah 


B. grahamii, N(negative) or P (positive) 




Dosh 


B. doshiae, N (negative) or P (positive) 




Cow 


Cowpox, N(negative) or P(positive) 




Ana 


Anaplasma, N(negative) or P(positive) 




Bab 


Babesia, N (negative) or P (positive) 





Captured voles were marked with a unique identifying passive transponder 
tag in order to be recognised in later captures. At each capture, a 20 — 30/iZ 
blood sample was taken for pathogen diagnostic tests. PCR assays were used to 
directly test for evidence of infection with Anaplasma phagocytophilum, Babesia 



microti and the three Bartonella spp. (see Courtney et al. 
[1( [2008| and |Telfer efoT 



(2008)). Antibodies to cowpox virus were detected by 



immunofluorescence assay (see 

the observed and derived variables is given in Table 111 



(2004), Bown et al. 



Chantrey et al. (1999)). A brief description of 



Table 2: Frequency of missing values per vole as a function of the number of 
lunar months from the first capture time to the last capture time. 



Lunar months from 
first to last capture 




Values missing 







1 


2 


3 


>4 





832 


- 


- 


- 


- 


1 


275 


- 


- 


- 


- 


2 


132 


74 


- 


- 


- 


3 


75 


55 


15 


- 


- 


4 


30 


49 


33 


9 


- 


5 


21 


24 


34 


5 


3 


6 


7 


7 


33 


15 


2 


7 


1 


4 


25 


16 


9 


>7 





2 


27 


12 


15 



After some processing (described in detail in Xifara (2012)) our dataset 



Table 3: Summary information for the six diseases; number of missing values 
despite the vole being captured and numbers of negative (N) and positive (P) 
responses. 



Disease 


# additional 
missing values 


#N 


#P 


B. doshiae 


46 


3583 


715 


B. grahamii 


44 


3468 


832 


B. taylorii 


32 


3139 


1173 


Babesia 





2354 


1990 


Cowpox 


85 


1408 


2851 


Anaplasma 


6 


4107 


231 



contains 4344 captures of 1841 voles. Only voles that have been caught at least 



twice are directly informative about transition probabilities (see Section 2.1) 



although voles that have been captured only once still contribute to inference 



for the initial distribution of each hidden Markov chain (see Section 3.1.1 ). 

The dataset contains a substantial fraction of missing data: almost half of 
the voles are not captured at every lunar month between the first and last times 
they were observed. Thus, even for many of the voles that were observed at 
least twice, not all of the covariates are available, either because the vole was 
not caught in a given lunar month, or sometimes because the vole was caught 
but a given variable was not ascertained. Table[2]shows the frequency of missing 
values derived from the first cause. The number of additional missing values, 
where it was not possible to ascertain the status of a particular disease, despite 
the vole being captured, is given in Table |3] This table also shows the frequency 
of positive (P) and negative (N) records for each disease. 

1.3 Statistical challenges 

We aim to investigate potential interactions between the six pathogens of the 
study. In particular, for each disease, d, we wish to evaluate the way in which 
the presence or absence of each of the other diseases (and perhaps further in- 
formation such as whether or not any infection is in its first month) affects 
the probability of a vole contracting d. Additionally where applicable we are 
interested in how other diseases affect the probability of recovery from d. 

We could model each disease as a two-state discrete-time Markov chain, 
where State 1 corresponds to no disease and State 2 to presence of disease; 
however, this two-state model imposes a very specific structure. For example 
the length of any infection is geometrically distributed; however it might be 
that the probability of remaining infected when a disease is in its first month 
(an acute phase) is different to that in subsequent months (chronic phase). It 



has also been found (e.g. Telfer et al. (2010)) that acute and chronic phases 



of a disease di can have different effects on the probability of a vole contracting 



disease d2- A two-state semi-Markov model (see, for example, Guedon (2003)) 



could account for the first effect, at the expense of extra complexity, but not 
the second. To adequately represent both the dynamics and influence of each 
disease with minimal extra complexity, therefore, in this analysis the dynamics 
of all but one of the diseases is modelled as a Markov chain with more than two 
states. Section [2?T] details the model for each disease. 

Only knowledge of the presence or absence of the disease is available to 
us. In general, this equates to knowledge of a subset of the state-space in 
which the true state must lie, but not to the exact state of the chain. For 
example, for all but one disease. States 2 and 3 both correspond to presence of 
the disease. In disease modelling. Hidden Markov Models (HMMs) arise when 
the Markov model for disease progression has a number of stages, or states, 



but these are not directly observed (e.g. Guihenneuc-Jouyaux et al. (2000) 



Chadeau-Hyam et al. (20101). Often the relationship between the state of 
the Markov chain and the observation is stochastic, although in our case there 
is no stochasticity involved, but the state of the Markov chain is nonetheless 
hidden. Furthermore, observations are only available to us when the vole has 



been captured. The forward-backward (FB) algorithm (see Section 2.3) can be 
applied to any discrete-time HMM with a finite state-space and addresses both 
of these issues. 

We consider D = 6 diseases, and hence six interacting (or coupled) HMMs. 
It is possible to consider the coupled Markov chains for each disease together as 
a single Markov chain on an extended state-space. In this case the likelihood 
function is straightforward to evaluate using the forward-backward algorithm 



(see e.g. Zucchini and MacDonald (2009)) and a Bayesian analysis can then be 



performed using MCMC. In our particular scenario the state-spaces have size 
4, 4, 4, 3, 3, 2, which would lead to an extended state-space of size 4^^ x 3^ x 2 = 
1152. Since the forward-backward algorithm applied to an HMM with n states 
takes 0{n^) operations, a naive implementation of the algorithm applied to the 
extended state-space would be 0(11522)/0(3x42 -1-2x3^-^22) w 0(18959) times 
less efficient computationally; equivalently, 100000 iterations of an algorithm 
which deals with each chain separately would take the same CPU time as 5 or 
6 iterations of the single-chain algorithm. In our specific scenario, but certainly 
not in generality, some of the transition probabilities in each individual chain are 
zero, and (in our scenario) only 32768 elements of the extended transition matrix 
would be non-zero. The use of sparse matrix routines could therefore reduce 
the efficiency ratio to approximately 468. Such a reduction in computational 
efficiency would only be justified if fraction of missing data were very close to 1 
so that the mixing of our Gibbs sampler would be extremely slow. 



Pradel (2005) analyses capture-recapture data using an HMM, and incor- 



poration of covariate information within this framework via an appropriate link 



function is straightforward (see |Lachish et al. (2011), Zucchini and MacDonald 
[](2009l (Section 8.5.2)). However the methodology does not allow the use of 



multiple HMMs nor, therefore, can it use the state of each HMM as a covariate 
for the other HMMs. We require six HMMs (one for each disease) and we wish 
to use covariate information such as the time of year and weight of the vole. 
Furthermore we wish the covariate set for each disease to include the states of 



the HMMs for the other diseases. For each disease, d, we will represent the 
probability of each possible state change through a logistic regression. However 
some of the covariates, the states of the other HMMs, are unknown. Our so- 
lution is a Gibbs sampler which employs the forward-backward algorithm and 
adaptive random walk Metropolis steps to jointly sample from the true posterior 
distribution of all of the HMMs and the covariate parameters. 

1.4 Outline 

The remainder of this paper is organised as follows. Section 2 describes the 
model which was used for each disease, gives its likelihood function, and outlines 
the imputation of missing weight values and the other fixed covariate values. 
The Markov chain Monte Carlo algorithm is described in Section 3 and we 
present our results, including the sensitivity study, in Section 4. The paper 
concludes with a discussion. 

2 Modelling the hidden and missing data 

2.1 Hidden Markov models and notation 

Hidden Markov models (HMM) are used when observations are influenced by a 
Markov process but the state of the Markov process itself cannot be determined 
exactly from the observations. Usually the relationship between the Markov 
process and the observation process is stochastic, but (as in our application) 
this need not be the case. For various examples and applications of HMMs see. 



for example, Zucchini and MacDonald [(2009). Purely to simplify our subscript 



notation we consider each vole to have been first observed at a local (to the 
vole) time of 1 and last observed at (local) time T. For disease d {d = I, . . . ,D), 
let the state space for the Markov chain be S^'^l and the state space for the 
observation process be Y := {N,P}. For a given vole and for disease d, the 
unobserved Markov chain and the observations are respectively 



XM :. 
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[d] [d] [d 

1 1^2 ' • ■ • i^T 
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Yf ' is conditionally independent 


oiX^f,...,xl'\ 


^[d\ 


^[d\ 




given 


xr 


The observed 
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t,i ^ 




if 


y\. — missing. 









This vector is defined for each disease in Section [2751 

We take the discrete time interval of each Markov chain to be one lunar 
month. Since trapping sessions in winter took place every two lunar months (see 



Section 1.2 1 this inevitably leads to missing observations for any vole caught 



several times over the winter, even if it is caught at every trapping session. 



For each unknown transition probability (see Section 2.5) we have a logistic 



[d] 



regression model; for example the probability, pii n that a given vole will be in 



state 1 (disease d absent) at time t - 
given by 



1 given that it is in state 1 at time t is 



,ld] 



Ml. 



Here z\ ' is the vector of covariates at time t, which for all models includes the 
states of the other diseases at time t, Xj , as well as a deterministic covariate 
vector, Zj . This deterministic vector was chosen via forward fitting of logistic 



regression models that were very similar to those of Telfer et al. (2010) and 



|Bcgon et al. \ (|2009|) (see Section [lTJ). However, whereas [Telfer et al \ ( |2010 1 



and Bcgon et al. (2009) allow both the current covariates and covariates one 



lunar month into the future to influence the response one lunar month into the 
future, we only allow the current covariates to influence the future response; 



further details are available in Xifara (2012). For all diseases the deterministic 



vector consists of a time trend (lunar month, Lm, as a continuous covariate) , a 
seasonal cycle in the form of sin and cos, and sex, weight, and site. The 
covariate vector for cowpox also includes a different trend with lunar month for 
each site, and for all other diseases it allows for a different seasonal cycle for 
each sex (see Table [I] for detailed covariate descriptions). 

We denote the transition probability matrix from time t to t + 1 for disease d 



by Pl'^l ( (3 



j[rf] J-d\ Jd\ 



P, 



(/si'^U! 



-d\ \d\ 



Prix. 



.[d\ 



--]\x.^ 



[d\ 



and let the initial distribution for the hidden chain be ■k^'^K Figure [ij depicts a 
simplification of our scenario, where there are just two diseases. Note that the 



states of all chains at time i + 1, X, 
on the states of all chains at time t. 



[1] 
t+i' 



X\j/[, are independent conditional 



2.2 Likelihood function 

We now provide full detail of the likelihood for a given vole. The likelihood 
for the data is simply the product of these likelihoods over all 1841 voles. 

Let /3li^^l := {(3^^\ (P\ . . . , (3^""^) , y^^^l = (yW.yl^l, . . . ,y[^l) and xH^^l = 
(x[-'^l,x[^l, . . . ,x[-'^l). The conditional independence structure leads to a com- 
plete data likelihood of 

L(yi-i xi.-i;3'-i,.--i) - n-ifvii/n '-;?.,.«. (/"''."S-"-!")'!:,.; 

d=i t=i * ■ '+1 

The observed data likelihood for the vole is then 

i(^y[l:I5]. ^[1:13] ^^[1:1^]^^ ^ L (y [l^^l , x''^""' ; /3''^''U''^''1 ) , (1) 

where §* := §[^1 x • • • x §[-^1. For a single chain the summation over the hidden 
states can be written as a matrix product; this simplification is not possible for 



[d]- 




Figure 1: Directed graph of a realisation of two parallel hidden Markov mod- 
els; where yl , t 



1, 



, , 5, where present, are the observed values of disease 
d {d = 1, 2) and X^ , t = 1, . . . , 5, are the states of the hidden Markov chain 
for d, arising from (unknown) initial distribution ttI'^I. The nodes from /3' ' 
reflect dependence of the transition matrix on the (unknown) covariate param- 
eters. To simplify the presentation and focus on the coupled HMM we omit the 



deterministic covariate vectors z 



'[d] 



coupled chains as the transition matrix for each disease depends on the state of 
each of the other diseases. Our Bayesian analv sis r equires prior distributions for 
f3^^' ' and tt^^'-^^ which are detailed in Section 3.1 The product of the observed 
data likelihoods over all voles multiplied by the prior distribution for /3' ' ' and 
ttI^-^I gives, up to a constant of proportionality, the joint posterior for (3^ ' ', 
ttIi^^I andxti^^l. 



2.3 The forward-backward algorithm 

The forward-backward algorithm developed by Baum et al. (1970) (see also 



Zucchini and MacDonald (2009), Scott (2002), Rabiner (1989) and Chib 



(1996), for example) may be applied to any discrete-time hidden Markov model 
with a finite state-space, and provides us with two useful tools. The forward 
recursion is a computationally efficient algorithm for calculating the likelihood of 
the observed data, while the backward recursion provides us with the distribution 



of each hidden state, Xt, given the state at the next time point, Xt+i, and all of 
the observations. Both will form part of the Gibbs sampling scheme that will 
be described in detail in Section [SiSl 



2.4 Other missing covariates 



As mentioned in Scction[L3j for many voles not all of the covariates are available. 
For a given vole, the covariates sex, site, and Lm clearly carry over to the 
missing records. The unobserved disease states will be treated dynamically and 
will be sampled from the conditional distribution as part of the Gibbs sampling 



scheme (see Section 3.2 1. Such sampling could perhaps also be performed for 
weight. However here we adopt a simpler approach whereby each missing weight 
value is imputed once via linear interpolation between the two nearest observed 
values for that vole. The robustness of inference to other sensible imputed 



weight values obtained by using a growth model is investigated in Section 4.3 



2.5 Details of the Markov models for individual diseases 

The remainder of this section gives a brief description of each disease in the 
study and describes the hidden Markov model that is used to model it. All 
transition probabilities are time dependent since some of the covariates are 
time-dependent; however for ease of notation we drop any explicit reference to 
this time dependence. A more detailed description of the host resources required 
by these parasites and a discussion about host immune responses can be found 
in 



Telfer ei a/. (2008) 



2.5.1 Bartonella species 

Bartonella is a genus of bacteria that infects mammals (including humans) , usu- 
ally transmitted by arthropods. The species investigated here are transmitted 
by fleas (Bown et al. |(2004[)). We assume that the effect of other diseases and 



covariates on the probability that a vole will recover from a particular Bartonella 
species after the second (third fourth etc.) lunar month is the same as for the 
effect on the probability of recovery after the first month; there are no grounds 
for assuming otherwise. However, since the majority of Bartonella infections 



last for one month and only a few last more than two (Birtles et al. (2001) 



Telfer et al. (2008)) the overall probabilities of recovering after the first and 
second month are likely to be different. Additionally a vole's chance of contract- 
ing a particular Bartonella species for the first time might be different from the 
chance of contracting it again after recovery from it in the past, although again, 
there is no reason to assume that the effects of other diseases and covariates on 
this are likely to be different. This suggests that each Bartonella species could 
be sensibly modelled using a Markov chain with four states: l=no infection, 
2=new infection, 3=old infection, 4=uninfected but has had a past infection. 
However, the observed sequence indicates either negative (N) or positive (P) 
status. In particular, an observation oi y[ ' =: N corresponds to hidden process 



of Xf^ = 1 or Xf = 4 with likelihood vector \^f^ 
tion oi y\ — P corresponds to X\ —2 or X\ = 



= (1,0,0, 1), and an observa- 
3 withl[''l = (0,1,1,0). The 
1 for 



tinie-inhomogeneous transition probability matrix from time t to time t - 
this Markov chain is 



P = 



1 - Pl2 Pl2 











1 -P24 


P24 





1 -P34 


P34 


P42 





1 - P42 


ibility is governed by a Ic 


)gistic regr( 


logit{pi2) = 


/3oa2 + z 


h' contract 1 


logit{p2i) = 


/3o,24 + z 


^0 

r^ recover "> 


logit{p3.i) = 


/3o,34 + z 


^/3 

H recover") 


logit{p42) = 


/3o,42 + Z 


/^contract' 



(2) 



(3) 
(4) 
(5) 
(6) 

As justified above, we use the same vector of covariate effects (^contract ^'^^ 
the two probabilities related to contracting the particular Bartonella species. 
Similarly we use the same covariate effects for the two probabilities relating 
to recovery from the disease, f3rf,^g^grj ^^ allow only the intercepts to differ. 
This assumption prevents a further increase in the, already large, number of 
parameters to be estimated. For example, the logistic model for the probability 
of contracting B. taylorii for the first time at lunar month Lm + 1 will be 



logit(Pi2'') 



A), 12 + Pwt^eight + PLm'Lm + Psexl{male) + /3s2/(site2) + A3J(sitc3) + /?s4/(site4) 
+I3sin sin +l3cos cos +I3sex:3inl{male) sin +/3sei::cos/(maIe) cos +^j,rah2/(grah2) 
+/3sr<i?i3-f (grahS) + /?j,rah4/(grah4) + P do sh2 1 {dosh2) + fidoshsl (doshS) 
+/3dosh4-^(dosh4) + /?6a62/(bab2) + /?i,a637(bab3) + /3com2-^(cow2) + /3co™3-^(cow3) 
+/3ana2/(ana2). 

Here and elsewhere /(•) denotes the indicator function, and [diseaseja; is a 
statement that the hidden chain for [disease] is in state x. 



2.5.2 Babesia 

Babesia microti can cause haemolytic anaemia in infected hosts. It is a chronic 
infection, which is to say that once a host is infected it is never again free of the 
disease. The effect of a Babesia infection on the probabilities of contracting or 
recovering from one of the other diseases may depend on whether the Babesia 
infection is acute (in its first month) or chronic. 

We therefore model Babesia using a Markov chain with the following three 
states: l=no infection, 2=acute infection, 3=chronic infection. Here the likeli- 
hood vector that connects the states with the observations is analogous to that 
for Bartonella species but ignoring state 4. The transition matrix is 



P = 



1-P12 Pl2 

1 

1 



10 



As in the previous section, a logistic regression relates pi2 to the covariates, 
including the states of the other diseases. 

2.5.3 Anaplasma 

Anaplasma phagocytophilum is a tick-borne bacterium that causes the disease 
granulocytic ehrlichiosis in humans. In the dataset there are relatively few 
positive records for Anaplasma and thus little power to ascertain transition 
probabilities and covariate effects from a third state of, for example, "currently 
uninfected but was previously infected" . Therefore, we use a two-state Markov 
chain with the following transition probability matrix 



P = 



I-P12 P12 
P21 1 - P21 



with separate logistic regressions relate pi2 and P21 to the covariates, including 
the states of the other diseases. This therefore is the only disease for which the 
underlying Markov model is not hidden. 

2.5.4 Cowpox 

In voles and other wild rodents, infection with cowpox virus is known to last 
for approximately 4 weeks (Bennett et al. (1997[)). The diagnostic test, how- 



ever, detects antibodies to the virus, not the virus itself. Antibodies appear 
approximately 2 weeks after contracting the infection but then remain present 



in the blood stream of a vole for the rest of its life (Bennett et al. (1997)). 
Since the disease lasts for approximately one month we model the progression 
as a Markov chain with three states: l=antibodies absent and disease absent, 
2=antibodies present and disease present, 3=antibodies present and disease ab- 
sent. Therefore, the form of the transition matrix and the relationship between 
the states and the response is identical to that for Babesia. The difference is in 
the interpretation: here State 3 corresponds to a positive response but absence 
of the disease, whereas for Babesia State 3 corresponds to a positive response 
which means that the disease is present. 

3 Bayesian approach 

3.1 Choice of prior 

3.1.1 Initial probability distribution 



The likelihood (Section 2.2 1 and the forward-backward algorithm (Section 2.31 
require, for each disease, the initial distribution, tt^'^I, of the Markov chain on 
the set of states for that disease at the first observation time for each vole. Our 
time-inhomogeneous Markov chains admit no limiting distribution and so the 
popular choice of setting the initial distribution to the limiting distribution of 
the chain is not available to us. 



11 



We choose then to estimate this distribution for each disease through our 
Gibbs sampler in Section |3.2[ We choose independent and relatively vague 
Dirichlet priors, Tr^'^l ^ Dir{a.^'^), with a^'^^ set to a vector of ones with length 
equal to the cardinality of state space S^'*!. This is equivalent to a uniform prior 
on each tt^'^^. 

3.1.2 Prior distributions for the regression parameters 

A similar longitudinal dataset to the one that we analyse was also available to 
us. This additional dataset arises from an earlier, three year study which was 
conducted using the same sampling design, but where the response for Bar- 
tonella was a single indicator for presence and absence, rather than an indicator 
for each species. For each of Babesia, Anaplasma, and cowpox we were therefore 
able to fit a logistic regression to a subset of the additional dataset as briefiy 
described in Sections |1.1| and |2.1[ except that the three indicator covariates 
for presence or absence of each Bartonella species were replaced with a single 
indicator covariate for presence or absence of at least one Bartonella species. 
Parameter estimates from these analyses were used to inform our choice of prior 
for similar parameters in our main analysis. 

Since the additional dataset does not distinguish the Bartonella species, 
there is not an exact correspondence between parameters from the simple anal- 
yses and the parameters in our main model, and some of the parameters in our 
main model have no counterpart in the simple analyses. The priors for each /3' ' 
{d = 1, . . . , D) in our main analysis are independent and Gaussian with covari- 
ance matrices that we denote by V'-'^'. For Babesia, Anaplasma, and cowpox, 
where parameters do approximately correspond, we set the prior mean for the 
parameter in our main analysis to the MLE for the corresponding parameter in 
the simple analysis of the additional dataset and the block of V^'^ associated 
with these parameters to 9 times the analogous block from the simple analysis. 
Where no corresponding MLE exists, the prior mean was set equal to zero and 
the block of V^"^ was a diagonal matrix where the diagonal elements were set 
to nine. 

3.2 Adaptive Metropolis-within-Gibbs algorithm 

In our dataset, the target parameter, (3, can be naturally partitioned into six 
sub-blocks, one for each disease. In the Gibbs scheme we wish to update the co- 
variate parameters for each given disease rf, /3' , using a random walk Metropolis 



step (RWM) (see e.g. Gilks et al. (19961); however the efficiency of a given 



RWM algorithm d epends heavily on the cho i ce of the v ariance of the proposal 



jump, rl*^!. Both Roberts and Rosenthal (2001) and 



Sherlock and Roberts 



I] ( 2009 1 suggest that a RWM algorithm might achieve near optimal efficiency 



when X'l'^1 correctly represents the general shape of the target distribution, for 
example if it is proportional to the variance of /3' ' or the inverse curvature at 
the mode. We therefore generalise the adaptive RWM algorithm described in 
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Sherlock et al. [ (2010) to an adaptive Metropolis-within-Gibbs algorithm on D 
sub-blocks. 

Let f3^^'-°^ = (/SW,/?!^!,...,/?'-"!), where f3^'^^ is the parameter set for the 
dth sub-block. A single iteration starts from an initial value /3' ' ', cycles 
through all the sub-blocks updating each in turn, and finishes with /3''^' ' = 
(/3''^1 , /3''^1 , . . . ,(3'^^'). In the update for the d*'' block the current and proposed 
values are respectively: 

/3[i^^l := /3'W,...,/3'['^-il,/3[''l,/3[''+il,...,/3f^l, 
^*li:D] .^ /3'W,...,/3'M-il,/3Ml+e,/3M+il,...,/3[^l. (7) 

Here the proposal jump e in (l7| for the dth sub-block at the nth iteration is 
sampled from a mixture distribution: 

N (0, f mlfH Sl^'^ j with probability 1 - 6, 
AT ( 0, fmjfn Z'(5''M with probability ,5. 

Here J is a small positive constant, which we set to 0.1, Sg is a fixed covariance 



matrix and rric is set to the theoretically derived value 2.38/fc^/^ (Roberts and 
(2001)), where k is the dimension of ,9' '. The matrix Sn is the 



Rosenthal 



M 



estimated variance of /3' ' using the sample from the Markov ch ain to date. The 
scaling factor to„ for the adaptive part is initialised to mp . 



Sherlock et al. 



(2010) details the adaptation of the scaling factor at each iteration, which leads 



to an equilibrium acceptance rate of 25%, close to the optimal acceptance rate 



(approximately 23%) derived by Roberts and Rosenthal (2001). 

The conditional likelihood for /3' ' \y^'^ , x^~'^ , tt^'^^ is calculated using the for- 
ward part of the forward-backward algorithm for disease d. This provides the 
conditional posterior for /3^ ' and hence the acceptance probability for the pro- 
posed value of ,9' '. The states of the hidden Markov chain for disease d can 
then be sampled from their conditional distribution given the observed data for 
that disease, ttI'^I, f3^ ', and the states of the other diseases using the backwards 
part of the forward-backward algorithm. 

Given the states of the hidden chain for a particular disease, d, and in par- 
ticular, the initial state for each vole, the conditional conjugacy of the Dirichlet 
distribution allows straightforward sampling from the conditional posterior for 

We therefore simulate from the joint posterior distribution of the coefficients 
of the logistic regressions for the transition probabilities, the hidden disease 
states and the initial probability distribution of the hidden states with the fol- 
lowing MCMC algorithm. 

At the start of the current iteration of the chain let the covariate parameters 
be /3' ' , let the hidden states be X^^-^l and the initial distribution of the 
hidden states be ttI^-^I; denote their values at the start of the next iteration as 
f^'li:D]^ ^,[1:0] a^d ^,[i:D] respectively 
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Each step of the Gibbs sampler is as fohows. 

• Perform an adaptive RWM update according to /3''^' |y[-^l , /3'^' , xl^'^1 , ttI^I . 

• Simulate the hidden states for the first disease from X't^l |y I^l , /3'[^1 , xl^^-^l , Trl^l . 

• Simulate the initial probability distribution of the chain for the first disease 

7r'W|x'W. 

• Perform an adaptive RWM update according to /3'[^1 jyl^l , (3^'^'^ , x'l^l , xP--D1 , ttI^] . 

• Simulate the hidden states for the second disease from X'l^l |y[^l , /3''^1 , x'l^l , xl'^'-^l , ttI^I . 

• Simulate 7r'[2l|x'[2l. 

• . . . 

• Perform an adaptive RWM update according to /3'[^1 |y [-"l , /3[^1 , y^\-'^-D-n ^ ttI^I . 
. Simulate X'l^I |y 1^1 , /3'[^1 , x'[i^^-il , ttI^I 

• Simulate 7r'[^l|x'[^l. 

The adaptive RWM step requires the fixed covariance matrix Eg ' . For each 
disease, a separate non-adaptive RWM was performed for the logistic regressions 
coefficients associated with the hidden Markov model for this disease that are 
not associated with the other diseases; for example weight and sex. The block 
of Eg ' associated with these covariates was estimated directly from this run. 
Each of the remaining /3 coefficients was given a small proposal variance and was 
assumed to be uncorrected with any of the other coefficients. Also to ensure 
a sensible non-singular Sn , for each disease, proposals from the adaptive part 
were only allowed once at least 1000 proposed jumps had been accepted. 

4 Analysis and results 

4.1 Convergence of the algorithm and model diagnostics 

All the computationally intensive parts of the algorithm were coded in C within 



an R (R Core Team (2012)) wrapper. On a computer with an Intel Nehalem 
2.26 GHz GPU, 100,000 iterations of the algorithm took approximately 3 hours. 
Three independent Markov chains of length 350,000 were generated from the 
algorithm in Section |3.2| each chain was started from a different position. Six 
of the 233 trace plots from one of the chains are reproduced in Figure [2J Most 
of these, over the first few tens of thousands of the iterations the variance of the 
proposal increases as the adaptive algorithm learns the shape of the posterior; 
this was the case in many of the 233 trace plots. 



The Gelman- Rubin statistic (Gelman and Rubin (1992)), R, was calculated 



from the three chains for each of the 233 components oi j3. Figure |3] shows the 
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mean of the estimated R statistics, the maximum of the estimated R statistics 
along with maximum of the 97.5% quantiles of R statistics, plotted against 
iteration number. The plot suggests that a burn-in period of 150,000 iterations 
should be more than sufficient. Inference is therefore performed using the final 
200, 000 iterations from each of the 3 runs combined. 

To assess model fit we examine the posterior predictive distribution of the 



data (see Robert, Ryden and Titterington (19991). We chose, at random, 100 



captures where all six diseases were observed and created an alternative dataset 
where all diseases for these captures were marked as missing. We refitted the 
model and estimated the posterior probabilities, p, of these artificially removed 



observations being positive. A Hosmer-Lemeshow test (e.g. Collett (2003)) 
for each disease provides a p-value for the null hypothesis that each of the 
true (binary) observations arises from a Bernoulli trial with the given posterior 
probability. For the six diseases we obtained p-values of 0.443 [B. doshiae), 
0.188 {B. grahamii), 0.061 {B. taylorii), 0.141 {Babesia), 0.56 (Anaplasma) and 
0.322 (cowpox). 

4.2 Posterior inference 

We are interested in interactions between diseases, for example in whether or 
not presence or absence of disease di affects the probability of a change of state 
for disease d2. In each logistic regression for each transition matrix, we therefore 
examine the coefficients that correspond to the states of the other diseases. We 
are also interested (for Bartonella) in whether or not the status of an infection 
(new or old) affects the chance of recovery, and in whether or not a previous 
infection affects the chance of a new infection with the same species; these 
correspond respectively to the contrasts: /3o,34 — /3o,24 and /3o,42 — /3o,i2- 



Formal model choice, for example via reversible jump MCMC ( Green ( 1995 1), 
is computationally infeasible here. Instead we take a high posterior probabil- 
ity that a given parameter or contrast is positive (or a high probability that 
it is negative) as indicating a potentially important effect. For an individual 
parameter we might consider P(positive) > 0.975 or P(positive) < 0.025 as 
indicating a likely effect. We are however interested in a total of 116 parame- 
ters and contrasts which raises a similar problem to that of multiple testing in 
classical statistics. Whilst considering probabilities below 0.025 or above 0.975 
to indicate a possible interaction, we therefore take probabilities below 0.00025 
and above 0.99975 as indicating a very probable interaction. 

Table|4]shows those parameters for which the posterior probability of positiv- 
ity is either above 0.975 or below 0.025. The table shows the posterior median, 
a 95% credibility interval, and the posterior probability that the parameter is 
positive. 

Firstly, and most clearly, the presence of Babesia decreases the probability of 
contracting Bartonella and increases the probability of recovery from Bartonella. 
This is true for both chronic and acute Babesia infections and for all three species 
of Bartonella. There is no evidence for the reverse interaction, that is for the 
presence of Bartonella affecting the chance of contracting Babesia. 
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Gelman-Rubin Statistic 



\ • 



mean GR point estimators 

■ — max GR point estimators 

■ ■ ■ max 97.5% quantile estimates 




last iteration in chain 



Figure 3: Combined Gelman-Rubin statistics for all 233 /3's; i?i, . . . ,i?233- 

2g3X)j-Ri( — ); maxi Ri{ ); max^ i?i^o.975(- - -)■ Statistics are plotted on 

the log scale against iteration number, along with the ideal ratio of log(l) ( — ) 
and the threshold of log(1.2) ( — ) suggested in Gelman (1996). 



For two of the three Bartonella species (B. taylorii and B. grahamii) it 
appears that a vole is less likely to be re-infected following previous exposure 
while it is more likely to recover from an old infection of B. taylorii than a new 
one. Furthermore, a vole that has recovered from a B. taylorii infection is less 
likely to contract B. grahamii. In addition, there seems to be a decrease in the 
probability of contracting B. doshiae when a vole has been exposed to B. taylorii 
whether or not it is still infected. Finally, infection with Anaplasma appears to 
increase the probability of recovery from B. grahamii; there was perhaps some 
evidence for the same interaction with B. doshiae with posterior probability 
0.9593. There is no evidence of a change to the probability of recovery from B. 
taylorii (probability= 0.416). It is also possible that a current infection with 
cowpox virus hinders recovery from B. doshiae and previous exposure to cowpox 
prevents infection with Babesia. 

4.3 Sensitivity analysis 

Three somewhat arbitrary choices were made in the set up of our model and 
priors: the interpolation scheme that fills in missing weight values, the prior 
for the initial distribution for the state of the hidden Markov model for each 
disease and vole, and the exact relationship between parameters estimated in 
the simple analysis of the alternative dataset and priors for parameters in the 
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hidden Markov models for the main dataset. 

An alternative for each of these choices is described below. For each al- 
ternative three further chains of length 350,000 were created and checked for 
convergence. Then any sizeable changes in the conclusions that would be drawn 
from the posterior distributions of the parameters were noted. 

In the main analysis, missing weight values were filled in via linear inter- 
polation. As an alternative we considered the logistic growth curve which was 



proposed in Burthe et al. (2009). We assumed Gaussian residuals for the 
logarithm of weight and allowed the logistic growth parameters to depend on 
covariates such as the sex of the vole and the time of year; some of the coeffi- 
cients were also allowed to include subject specific random effects. More details 



are provided in Xifara (2012) 



The initial distributions for the states of the hidden Markov models for the 
diseases are assigned independent Dirichlet priors with the parameter for each 
disease, a\'^\ a vector of ones. As an alternative prior we set each al'^I to be a 
vector of twos. 

In the main analysis, where there was a rough correspondence between a 
parameter in the simple analysis of the alternative dataset and one in the main 
analysis, we centered the Gaussian prior in the main analysis on the maximum 
likelihood estimate from the simple analysis and set the covariance matrix to 
be nine times the estimated covariance matrix from the simple analysis (Sec- 
tion 3.1.2). As an alternative we use vague but proper Gaussian priors for all 
parameters. 

Parameter estimates with the alternative weight scheme or with the alter- 
native prior distribution of the hidden states were very similar to the estimates 
from the main set-up. For all the significant covariates none of the posterior 
probabilities changed by more than 0.005. However, the use of vague priors for 
the parameters noticeably affected one of the twenty one covariates in Table 
|4] The effect on a vole's probability of contracting Babesia when the vole had 
been exposed to cowpox became apparently unimportant, with posterior proba- 
bility changing from 0.02 to 0.093. No additional covariates became potentially 
important (i.e. p < 0.025 or p > 0.975) in any of the three alternative runs. 

5 Discussion 

We have described a coupled discrete-time hidden Markov model for interactions 
between diseases in a host and used it to analyse data from a longitudinal study 
of field voles with records of six different pathogens. The Markov model offers a 
more detailed description than the existing modelling approach that is described 
in Section [l.l I Furthermore, by explicitly dealing with the missing observations 
(which comprise approximately 50% of the dataset), the inference methodology 
that we introduce is able to use more of the data than the existing standard 
inference methodology. 

Inference is performed via a Metropolis- within- Gibbs sampler that cycles 
through the diseases and, for each disease conditional on the hidden states of 
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Table 4: Posterior suramaries of model parameters of interest for which the 
posterior probability of positivity is either above 0.975 or below 0.025. For each 
parameter the posterior median, a 95% credibility interval and the probability 
that the parameter is greater than zero is provided. Each parameter arises 
from a logistic regression coefficient for a particular transition probability in 
the hidden Markov model for a particular disease. The disease and transition 
appear in the first column, and second column indicates the particular disease 
and state that is influencing the transition probability. State f is always taken 
to be the baseline. The contrasts /3o,42 ~ /3o.i2 ^nd /3o,34 ~ /5o,24 are defined in 
Section |42] 



Transition 


Covariate 


Median 


95% CI 


Posterior 


Probability 








Probability 






tay2 


-f.0f24 


(-f.8708, -0.f554) 


0.0111 






tay3 


-f.f077 


(-2.0584, -0.f877) 


0.0096 


„dosh 
Pl2 




tay 4 


-f.0304 


(-1.8046, -0.2278) 


0.0077 






bab2 


-f.5636 


(-2.4247, -0.8039) 


0.0001 






babS 


-f.7702 


(-2.3095, -1.3235) 


0.0000 






bab2 


2.96f3 


(1.6459, 4.5298) 


1.0000 


^dosh 




bab3 


3.3386 


(1.775, 5.5524) 


1.0000 


P2i 




cow2 


-f.Of 


(-2.0937, -0.0691) 


0.0171 


qrah 




bab2 


-f.2579 


(-2.0506, -0.5881) 


0.0000 


Pl2 




bab3 


-2.0294 


(-2.6841, -1.5096) 


0.0000 






bab2 


2.78f8 


(1.1981, 5.0649) 


0.9999 


qrah 
P24 




babS 


f.7288 


(0.8899, 2.754) 


1.0000 






ana2 


f.0284 


(0.0116, 2.1709) 


0.9764 


to-y 




bab2 


-f.6933 


(-2.6563, -0.815) 


0.0001 


Pl2 




bab3 


-f.3346 


(-1.8338, -0.8754) 


0.0000 






bab2 


f.5824 


(0.702, 2.5404) 


0.9997 


tay 
P2i 




babS 


f.75f3 


(1.0015, 2.6279) 


1.0000 


„bab 
Pl2 




cow3 


-0.4939 


(-0.9799, -0.0206) 


0.02 


Contrasts 










agrah. 
h'0,42 


agrah 
P0,12 




-f.3248 


(-5.6452, -0.2774) 


0.0076 


ntay 
P0,42 


atay 
P0,12 




-4.472 


(-8.6604, -2.1214) 


0.0015 


ntay 
P0,34 


atay 
P0,24 




f.f762 


(0.555, 1.8601) 


0.9998 



all of the other diseases, samples from the parameters of the logistic regressions 
for the transition probabilities of the hidden Markov model using an adaptive 
random walk Metropolis step and then from the exact distribution of the hidden 
states given these parameters. These two steps use respectively the forwards 
and backwards parts of the forward-backward algorithm (FB). 

The FB Gibbs sampler (e.g. |Chib | ([T996|, [Scott | ([2002| and [Fearnhead 



and Sherlock (2006)) also uses the forward-backward algorithm; however the 



motivation is different. The FB Gibbs sampler does not use the likelihood from 



19 



the forwards recursion directly, as this would require a Metropolis-Hasting (MH) 
update; instead the backwards recursion provides a sample from the posterior 
distribution of the hidden states given the parameters. Due to the conditional 
conjugacy structure of the problems targeted by the FB Gibbs sampler it is then 
possible to sample exactly from the conditional posterior for the parameters 
given the hidden states and thus avoid the MH step and the associated tuning. 
Our logistic regression model for the transition probabilities does not allow 
a simple Gibbs step for updating the parameters conditional on the hidden 
states, and so we content ourselves with a MH step for the parameters and, 
for efficiency of mixing, do not condition on the hidden states for the current 
disease. After the MH step we then sample from the hidden states for the 
disease so that these can be used as covariates for the other diseases; in effect, 
we therefore sample from the joint conditional distribution of the parameters and 
the hidden states for the disease. The FB algorithm could be avoided entirely 
by updating the logistic regression parameters conditional on the hidden states 
for all diseases, and by sampling from the distribution of each individual hidden 
state conditional on all of the other hidden states and the transition parameters 
(e.g. [Robert, Celeux and Diebolt | (|1993l), [Robert and Titterington | (|1998[)). 



However we believe that the correlation between hidden states and between 
these states and the parameters would have led to a very poorly mixing MCMC 



chain; Scott (2002) discusses the first aspect of this. 



Brand (19971, Saul and Jordan (1999), Rezek, Sykacek and Roberts (2000) 



Zhong and Ghosh ( 2002[ ) aiidNatarajan and Nevatia (2007) all examine infer- 
ence for coupled HMMs. The DAG for the HMMs considered in these articles 
is the same as in Figure [l] however in order to allow recursions similar to those 
in the forward-backward algorithm all of these articles - except [Rezek, Sykacek[ 
and Roberts ( 2000 ) - make the simplifying assumption that the transition prob- 



ability for a given chain conditional on the others is separable: 



P 



(^t+il 



[l-.D] 



oc 



Y[k,{xU4^), 



for some collection of non-negative functions fid- Moreover the computational 
complexity increases with the square of the sum of the number of states in each 
chain, in practice each article considers only two chains. We wish to apply 
logistic regression rather than assume separability and to consider six chains; 
furthermore computational complexity of our algorithm increases with the sum 
of the squares of the number of states in each chain. [Rezek, Sykacek and[ 
Roberts (2000) perform Bayesian inference via a Gibbs sampling algorithm 



that is qualitatively similar to our own; the fully conjugate structure that is 
used does not, however, allow for the effects of any covariates. 

We now examine the most major findings of Section [4.2[ and briefly discuss 
the biological insights that they offer. For B. taylorii, voles are more likely to 
recover from an old infection than a new one, which is to be expected given that 
more complete histories for individuals indicate that most infections last for 
only one month. Previous data also indicate that B. doshiae infections may last 
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longer than B. taylorii and B. grahamii infections (Telfer et al. (2007)). Also, 
for B. grahamii and B. taylorii, previous infection by the species appears to 
grant some form of immunity to that species, suggesting that hosts can develop 
an effective acquired immune response. To date, there has been conflicting 
evidence for such a response in wild populations, suggesting immune responses 



may vary between host species and/or Bartonella species (Birtles et al. (2001 ) 



Kosoy et al. (2004), Bai et al. (2011 1). Interestingly also B. taylorii iniection 



appears to provide immune cross-protection to B. doshiae infection. 

We found, for voles currently infected with Babesia, both a reduction in 
susceptibility to Bartonella and an increase in the probability of recovery from 
Bartonella over the next lunar month. We also found no evidence that a cur- 
rent Bartonella infection might influence susceptibility to Babesia over the next 



month. Telfer et al. (2010) find the Babesia covariates, both at time <o and ii 



to be significant for predicting the probability of catching Bartonella between 
io and ii . The Bartonella covariate at time ii is also found to be significant in 
predicting susceptibility to Babesia, apparently contradicting our findings. How- 
as mentioned in Section 2.1 Telfer et al. (2010) allow both the current 



ever 

(io) and future (ii) state of each disease covariate to influence the probability 
that, for the disease that is being treated as a response, a vole is positive at time 
ii given that it is negative at time t^. An assumption of only the negative ef- 
fects of Babesia on Bartonella infections apparent from inference for our HMM, 
and no other dependency between Bartonella and Babesia, is sufficient to lead 
to a negative correlation between Bartonella and Babesia at any given time. 
Consider two groups of voles, those with Babesia at io (Group A) and those 
without Babesia at to (Group B). Since Babesia is a chronic infection. Group A 
voles are more likely to have Babesia at ii than voles from Group B. However 
since Babesia impacts negatively on the probability of a vole having Bartonella, 
Group A voles are less likely to have Bartonella at ti than voles from Group 
B. Since the effect of Babesia on Bartonella is so pronounced (Table |4|, it is 
certainly believable that this negative correlation could be strong enough that 
each of the two diseases at ti appears as an important covariate for the other. 
In the application which we have considered, missingness was believed to 
be independent of disease state; in other scenarios, such as those considered 
in Pradel ( 2005 1 the probability that a given subject will be observed might 
depend on the states of each of the hidden Markov models. This could be 
accommodated within our methodology through a further logistic regression 
for the probability of being observed given the set of hidden states and other 
covariate information, and several other minor changes as detailed in [Pradel | 
(|2005|. 
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